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1.  Brief  overview  of  scientific  achievements: 

The  research  carried  out  under  this  contract  focussed  on  four  efforts,  all 
involving  the  processing  of  temporal  sequences  by  neural  networks  (1-3)  or  the  effect 
of  imposing  a  spatio-temporal  gradient  on  network  learning  (4): 

(1)  Assessing  alternative  neural  network  techniques  for  problems  involving 
temporal  coding. 

(2)  Development  of  tools  for  analyzing  reciorrent  networks,  so  that  the  solu¬ 
tions  of  successfully  trained  networks  can  be  better  understood; 

(3)  Development  of  a  dynamical  systems  theory  approach  to  computation  in 
recurrent  networks. 

(4)  Development  of  biologically  and  cognitively  plausible  techniques  for 
enhancing  training. 

Work  in  the  initial  area  was  carried  out  with  Thomas  Rebotier  (a  doctoral  stu¬ 
dent  in  the  Cognitive  Science  department  at  UCSD).  We  constructed  a  suite  of  prob¬ 
lems,  including  economic  indices,  seismic  data,  strings  generated  from  various 
classes  of  grammars,  speech  data,  and  acoustic  data.  Second,  we  have  developed 
local  implementations  of  Time  Delay  Neural  Networks,  Hidden  Markov  Models, 
Backpropagation  Through  Time,  and  Simple  Recurrent  Networks.  [This  work  was 
reported  in  detail  in  earlier  progress  reports  and  will  not  be  summarized  here.] 

Work  in  the  second  and  third  areas  was  done  in  collaboration  with  Rebotier, 
Paul  Rodriguez  (another  doctoral  student  in  Cognitive  Science)  and  Janet  Wiles 
(University  of  Queensland).  We  used  various  techniques  for  analyzing  the  movement 
of  networks’  internal  state  vectors  through  state  space,  over  time;  the  goal  is  to 
understand  how  the  networks  make  use  of  state  space  and  temporal  dynamics  to 
encode  temporal  information. 

Work  in  the  final  section  was  done  with  Thomas  Rebotier,  in  collaboration 
with  Mark  Johnson  (University  College  of  London)  and  Jeff  Shrager  (University  of 
Pittsburgh).  Development  of  biologically  and  cognitively  plausible  techniques  for 
enhancing  training.  The  goals  were  two-fold:  (i)  to  account  for  the  spatial  differentia¬ 
tion,  over  time,  of  initially  multipotent  embryonic  cortex;  and  (ii)  to  extend  the  com¬ 
putational  capacities  of  Hebbian  learning  by  imposing  a  spatio-temporal  gradient  on 
the  learning  process. 


2.  Summary  of  results 


2.1  Benchmarks. 

This  work  was  reported  in  detail  in  earlier  progress  reports  and  will  not  be 
summarized  here. 

2.2  Analytic  tools. 

A  common  complaint  regarding  neural  networks  (when  they  are  offered  as 
models  of  biological  processes)  is  “what  value  is  there  in  replacing  one  black  box  (e.g., 
the  brain)  with  another  black  box  (e.g.,  a  neural  network  that  emulates  some  brain¬ 
like  capability).”  The  complaint  is  reasonable,  since  much  of  the  value  of  a  model  pre- 
smnably  lies  in  our  ability  to  probe  it  to  a  greater  extent  than  is  possible  with  biolog¬ 
ical  systems  (for  which  invasive  procedures  are  limited  and  highly  regulated,  and  for 
which  non-invasive  tests  have  a  lesser  degree  of  precision).  However,  it  is  also  a  com¬ 
plaint  which  is  now  somewhat  dated,  since  most  network  researchers  recognize  that 
network  analysis  plays  a  crucial  role  in  validating  the  models  they  construct. 

Under  the  current  contract,  a  number  of  novel  network  analyses  have  been 
developed  and  utilized.  Many  of  these  are  not  novel  in  other  fields,  but  their  applica¬ 
tion  to  neural  network  research  is  (and  in  many  cases,  the  research  supported  by  the 
current  contract  was  the  first  to  utilize  them).  These  include  the  use  of  principal 
components  analysis;  projection  pursuit;  multidimensional  scaling;  and  contribution 
analysis.  In  the  relatively  short  period  since  the  beginning  of  this  contract  (1992), 
many  of  these  tools  have  become  standard  in  the  field. 

2.3  Development  of  a  dynamical  systems  account  of  computation  in  recur¬ 
rent  neural  networks. 

Relatively  little  is  still  known  about  the  computational  properties  of  recurrent 
networks.  Despite  early  proofs  about  Turing  capability  (J.  Pollack’s  thesis),  and  more 
recent  important  work  by  Seligmann  and  Sontag,  we  still  lack  the  kind  of  analysis 
for  recurrent  networks  which  the  Chomsky  hierarchy  provides  for  discrete  automata. 
The  Chomsky  hierarchy  maps  machines  onto  grammars,  and  indicates  the  computa¬ 
tional  benefits  which  result  from  extending  machine  resources  in  a  principled  way. 

Recently,  as  part  of  the  work  supported  by  this  contract,  Janet  Wiles,  Paul 
Rodriguez,  and  I  have  undertaken  a  series  of  studies  which  have  as  their  goal  under¬ 
standing  how  traditional  formal  languages  (as  classified  by  the  Chomsky  hierarchy) 
might  be  processed  by  recurrent  networks.  Ultimately,  we  hope  to  understand  what 
the  natural  classes  of  languages  are  (because  conceivably  the  natural  class  of  lan¬ 
guages  processed  by  recurrent  networks  may  not  be  commensurate  with  the  class  of 
languages  defined  under  the  Chomsky  hierarchy).  In  the  short  term,  our  focus  is  to 
understand  how  recurrent  networks  carry  out  computation;  our  specific  perspective 
has  been  to  study  this  using  dynamical  systems  analysis. 

In  initial  studies,  we  found  that  a  recurrent  network  could  process  a  Context 

Free  Grammar  (a”6”  e.g.,  some  number  of  o’s  followed  by  an  equal  number  of  6’s)  by 
setting  two  dynamical  regimes.  In  the  first  regime  (when  the  network  is  in  “counting 
up”  mode,  inputing  a’s),  the  network  has  an  attracting  fixed  point  (with  oscillatory 
behavior);  this  is  shown  in  Figure  la.  In  the  second  regime  (when  the  network  is  in 


“counting  down”  mode,  receiving  b’s),  the  network  has  a  repelling  fixed  point  (again, 
with  oscillatory  behavior);  this  is  shown  in  Figure  lb.  By  precisely  equilibrating  the 
rate  of  contraction  of  the  attracting  fixed  point  with  the  rate  of  expansion  of  the 
repelling  fixed  point,  and  by  ensuring  that  in  the  transition  from  the  last  a  to  the 
first  b,  the  network  moves  to  a  distance  from  the  second  fixed  point  which  is  matched 
to  the  distance  from  the  first  fixed  point,  the  network  guarantees  that  when  the  final 
b  is  input  it  will  recognize  the  end  of  string.  In  more  recent  studies,  we  have 
extended  this  work  to  more  complex  languages,  including  the  palindrome  language 

xx^  (a  string  of  inputs,  ;!c,  followed  by  x^  ,  the  string  in  reversed  form).  This  language 
is  interesting  because  it  resembles  center-embedded  relatively  clauses  found  in  natu¬ 
ral  language  (e.g.,  “The  book  that  the  girl  read  is  missing.”). 


Figure  1.  Vector  flow  fields  for  two  dynamical  regimes  of  network  trained  on  language,  (a) 

How  field  while  a ’5  are  input;  (b)  flow  field  while  b’s  are  input. 


2.4  Interactions  between  learning  and  timing. 

Many  of  the  neural  models  for  temporal  processing  which  have  been  studied  to 
date  suffer  from  scaling  problems.  The  models  work  well  with  restricted  data  or  on 
toy  problems.  Attempts  to  scale  up  to  larger  data  sets  or  to  time  series  in  which  the 
temporal  relationships  are  more  complex  often  do  not  work  well.  This  problem  of 
scaling  is  of  course  not  unique  to  neural  network  models;  the  failure  to  scale  is  a 
chronic  problem  of  many  models. 

I  have  recently  become  interested  in  the  possibility  that  the  developmental 
trajectory  which  humans  undergo  may  interact  with  the  learning  of  complex  behav¬ 
iors.  An  inordinately  long  portion  of  the  human  life  cycle  is  spent  in  a  period  of 


immatufity;  given  the  vulnerability  of  the  immature  state  this  would  seem  to  be  evo- 
lutionarily  maladaptive.  On  the  other  hand,  there  may  be  positive  consequences  to 
delayed  development.  My  hypothesis  is  that  in  fact  certain  problems  are  best 
learned  by  “starting  small” — ^i.e.,  with  limited  resources.  I  have  been  studying  this 
possibility  in  two  realms. 

(a)  In  the  first,  I  attempted  to  train  a  simple  recurrent  network  to  process 
strings  generated  by  a  context-free  grammar.  (This  is  a  category  of  formal  languages 
into  which  human  languages  are  minimally  classified;  human  languages  may  in  fact 
be  somewhat  more  complex.)  Although  humans  appear  able  to  learn  such  gram¬ 
mars,  recurrent  networks  consistently  failed,  across  a  wide  ranging  of  training  con¬ 
ditions.  However,  when  the  networks  were  trained  in  an  incremental  fashion,  they 
succeeded  in  learning  the  data  sets.  Incremental  training  was  carried  out  in  either 
of  two  ways  (both  worked  equally  well).  In  the  first  regime,  networks  were  trained 
on  a  subset  of  strings  which  were  shorter  in  duration  and  which  contained  no 
embeddings.  After  mastering  this  simpler  data  set,  the  networks  were  given  increas¬ 
ingly  more  complex  data.  In  the  second  regime,  networks  were  trained  from  the 
beginning  on  the  final  complex  data  set.  However,  noise  was  injected  every  two  or 
three  tokens  during  early  portions  of  training.  This  noise  effectively  interfered  with 
the  learning  of  the  more  complex  data.  As  training  progressed,  the  periodicity  of  the 
noise  was  increased  in  two  or  three  word  increments  and  eventually  eliminated.  In 
both  regimes,  learning  was  rapid  and  generalization  was  high.  The  technique  is  sim¬ 
ilar  to  a  hypothesis  proposed  by  Newport  for  children.  The  assumption  is  that  early 
resource  limitations  force  the  networks  to  focus  on  the  major  sources  of  variance,  and 
that  this  provides  a  necessary  scaffolding  for  the  networks  to  learn  more  complex 
interactions  exhibited  by  longer  sentences.  Thus,  for  the  networks  to  achieve  the 
final  “adult”  competence  requires  that  they  go  through  a  maturational  period  which 
resembles  that  of  children. 

Figure  2a  shows  the  internal  state  space  of  a  network  which  was  trained  in 
the  non-incremental  fashion;  the  state  space  is  relatively  unstructured  and  fails  to 
encode  temporally  significant  information.  Figure  2b  shows  the  internal  state  space 
of  a  network  which  was  trained  in  the  incremental  manner  described  above.  The 
state  space  is  well-structured  and  encodes  grammatically  relevant  information. 

(b)  In  the  second  series  of  experiments  (done  in  collaboration  with  Jeff 
Shrager  and  Mark  Johnson)  I  have  been  interested  in  the  possible  computational 
benefits  of  another  developmental  pattern,  namely  the  fact  that  neo-natal  human 
cortex  imdergoes  waves  of  synaptic  proliferation  followed  by  synaptic  pruning.  These 
waves  do  not  occur  everywhere  simultaneously.  Instead,  they  last  over  a  period  of 
several  years  and  pass  over  different  spatial  regions  of  cortex  at  different  points  in 
time. 

The  initial  series  of  experiment,  carried  out  by  Kerszberg,  Dehaene,  and 
Changeux  and  replicated  by  Shrager,  Johnson,  and  myself,  involved  a  slab  of 
‘pseudo-cortex’,  shown  in  Figure  3.  This  slab  consisted  of  a  30x30  matrix  of  nodes. 
Each  node  received  random  inputs  from  neighbors  following  a  Gaussian  probability 
distribution,  such  that  connections  from  near  neighbors  was  more  probable  than 
from  distal  units.  Each  node  in  addition  received  input  from  two  afferents  (marked  A 


Figure  2.  Plot  of  hidden  unit  activation  patterns  (in  response  to  presentation  of  10,000  word 
inputs),  shown  in  coordinates  of  first  3  principal  components,  (a)  Network  which  failed  in  the 
task,  (b)  Network  which  succeeded  in  the  task,  after  being  trained  with  an  incremental  regime. 


and  B),  which  fired  randomly  and  simultaneously.  The  connection  matrix  for  the  slab 
was  modified  according  to  a  Hebbian  learning  rule.  After  learning,  the  question  was 
asked  of  each  node,  what  function  of  the  two  afferents  is  it  computing.  Most  nodes 
remained  off,  but  some  fired  whenever  A  was  on;  others  fired  whenever  B  was  on; 
others  became  AND  units,  etc.  This  was  the  expected  result. 

However,  when  learning  progresses  in  a  staged  manner — modeling  the  move¬ 
ment  over  time  of  a  “trophic  factor”  through  the  matrix,  such  that  columns  under¬ 
neath  the  TF  wave  are  more  plastic  while  those  elsewhere  decay  or  do  not  learn — 
then  a  different  result  is  obtained.  We  have  found  that  if  the  TF  wave  moves  from 
left  to  right,  so  that  the  left-most  columns  are  early  learners  and  the  right-most  col¬ 
umns  are  late  learners,  then  the  units  on  the  left  develop  as  in  the  first  condition. 
However,  a  significant  number  of  units  in  the  late  learning  columns  become  XOR 
units.  This  is  a  surprising  result  given  the  known  problem  with  Hebbian  learning 
and  non-correlated  input  patterns;  it  results  from  the  fact  that  the  TF  wave  allows 
early  learning  units  to  develop  which  become  sensitive  to  OR  and  AND  functions. 
The  late  learning  units  then  take  as  their  input  not  only  the  external  afferents  but 
the  outputs  of  these  OR  and  AND  units,  and  that  makes  it  possible  for  them  to  learn 
XOR.  This  result  is  promising  because  it  demonstrates  that  a  leeu’ning  rule  of  known 
biological  plausibility  but  known  computational  limitation  may  be  “salvaged”  by  sub¬ 
jecting  learning  to  a  maturational  regime  which  is  itself  plausible. 

The  work  to  date  only  involves  temporally  static  stimuli.  We  are  now  tr3dng  to 
extend  this  finding  to  conditions  in  which  stimuli  are  more  complex  and  which 
involve  temporal  dependencies. 


local  interconnections 


Figure  3 
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Mozer  (Ed.),  Advances  in  Neural  Information  Processing  Systems  7.  San 
Mateo,  CA:  Morgan  Kaufmann. 

Wiles,  J.,  Rodriguez,  R,  &  Elman,  J.L.  (1995).  Learning  to  count  without  a  counter:  A 
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(c)  Invited  presentations: 

April,  1993:  Duke  University:  Workshop  on  Temporal  Coding  in  Biological  Syst  ems. 
Invited  talk. 

May,  1993:  University  of  Wisconsin:  Society  for  Research  in  Child  Language  Disor¬ 
ders.  Keynote  Address. 

October,  1993:  University  of  California,  Irvine.  Cognitive  Science  Department. 
Invited  talk. 

February,  1994:  Washington  University,  St.  Louis:  Program  in  Philosophy,  Neuro¬ 
science,  and  Psychology.  Invited  talk. 

March,  1994:  Ohio  State  University:  Cognitive  Science  Program.  Invited  talk. 

April,  1994:  Rice  University:  Cognitive  Science  Program.  Invited  talk.  Linguistics 
Department.  Invited  talk. 

April,  1994:  California  State  University,  Fullerton.  Cognitive  Science  Program. 
Invited  talk. 

April,  1994:  City  University  of  New  York:  Linguistics,  Cognitive  Science,  and  Child¬ 
hood  Language  Disorders.  Tutorial  and  invited  talk. 

December,  1994:  University  of  San  Marino,  San  Marino.  Symposium  on  Rethinking 
Innateness.  Invited  talk, 

April,  1996:  International  Conference  on  Infant  Studies.  Invited  talk. 

May,  1996:  Santa  Fe  Institute:  Workshop  on  Dynamical  Models  of  Cognition  Invited 
talk. 

May,  1996:  Society  for  Philosophy  and  Psychology.  Invited  talk. 

September,  1996:  Sussex  University,  England.  Invited  talk. 

September,  1996:  University  of  York,  England.  Invited  talk. 

October,  1996:  University  of  Oxford,  England.  McDonnell-Pew  Center  for  Cognitive 
Netmoscience.  Invited  talk. 

October,  1996:  University  of  Colorado  at  Boulder.  Invited  talk. 

November,  1996:  University  of  Texas  at  Austin.  Invited  talk. 

February,  1997:  University  of  Chicago.  Invited  talk. 

March,  1997:  UCSD/Salk  Center  for  Cognitive  Neuroscience  Retreat.  Invited  talk, 

March,  1997: 10th  Annual  CUNY  Conference  on  Sentence  Processing  (Los  Angeles). 
Invited  talk. 


May,  1997:  Carnegie  Symposium,  Carnegie  Mellon  University.  Invited  talk. 

August,  1997:  Computational  Psycholinguistics  Conference  (Stanford).  Invited  talk. 

August,  1997:  University  of  Texas  at  Austin.  Department  of  Psychology.  Invited  talk. 

August,  1997:  Computational  Psycholinguistics  Workshop  (Berkeley).  Invited  talk. 

November,  1997:  McGill  University.  Cognitive  Science  Program.  Hebb  Distinguished 
Speaker  Address. 

February,  1998:  Cornell  University  Cognitive  Science  Program.  Invited  talk. 

May  4:  University  of  California,  Irvine.  Department  of  Cognitive  Science.  Invited 
talk. 

May  19:  Hunter  College,  CCNY.  Department  of  Psychology.  Invited  talk. 


(d)  student  training: 
graduate  students: 

Arshavir  Blackwell  (Psychology/Cognitive  Science,  UCSD) 
Jay  Moody  (Cognitive  Science,  UCSD) 

Thomas  Rebotier  (Cognitive  Science,  UCSD) 

Paul  Rodriguez  (Cognitive  Science,  UCSD) 

Jill  Weckerly  (Cognitive  Science/Linguistics,  UCSD) 


postgraduate  scholars/visitors: 

Mary  Hare  (CRL,  UCSD) 

Thomas  Shultz  (Psychology,  McGill  Univ.) 

Janet  Wiles  (Computer  Science,  Univ.  of  Queensland) 


